Goto

Collaborating Authors

 Fushun



Spatial Crowdsourcing-based Task Allocation for UAV-assisted Maritime Data Collection

Han, Xiaoling, Lin, Bin, Na, Zhenyu, Li, Bowen, Zhang, Chaoyue, Zhang, Ran

arXiv.org Artificial Intelligence

Abstract--Driven by the unceasing development of maritime services, tasks of unmanned aerial vehicle (UA V)-assisted maritime data collection (MDC) are becoming increasingly diverse, complex and personalized. As a result, effective task allocation for MDC is becoming increasingly critical. In this work, integrating the concept of spatial crowdsourcing (SC), we develop an SCbased MDC network model and investigate the task allocation problem for UA V-assisted MDC. In variable maritime service scenarios, tasks are allocated to UA Vs based on the spatial and temporal requirements of the tasks, as well as the mobility of the UA Vs. T o address this problem, we design an SCbased task allocation algorithm for the MDC (SC-MDC-T A). The quality estimation is utilized to assess and regulate task execution quality by evaluating signal to interference plus noise ratio and the UA V energy consumption. The reverse auction is employed to potentially reduce the task waiting time as much as possible while ensuring timely completion. Additionally, we establish typical task allocation scenarios based on maritime service requirements indicated by electronic navigational charts. Simulation results demonstrate that the proposed SC-MDC-T A algorithm effectively allocates tasks for various MDC scenarios. Furthermore, compared to the benchmark, the SC-MDC-T A algorithm can also reduce the task completion time and lower the UA V energy consumption. RIVEN by the continuous development of maritime services such as marine resources exploration, reconnaissance and surveillance, anti-submarine, marine tourism, marine transportation and emergency collection, tasks of unmanned aerial vehicle (UA V)-assisted maritime data collection (MDC) are becoming increasingly diverse, complex and personalized [1]-[3]. Personal use of this material is permitted.



VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test

Wu, Meiqi, Kang, Yaxuan, Li, Xuchen, Hu, Shiyu, Chen, Xiaotang, Kang, Yunfeng, Wang, Weiqiang, Huang, Kaiqi

arXiv.org Artificial Intelligence

The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.


StableMotion: Repurposing Diffusion-Based Image Priors for Motion Estimation

Wang, Ziyi, Li, Haipeng, Sui, Lin, Zhou, Tianhao, Jiang, Hai, Nie, Lang, Liu, Shuaicheng

arXiv.org Artificial Intelligence

We present StableMotion, a novel framework leverages knowledge (geometry and content priors) from pretrained large-scale image diffusion models to perform motion estimation, solving single-image-based image rectification tasks such as Stitched Image Rectangling (SIR) and Rolling Shutter Correction (RSC). Specifically, StableMotion framework takes text-to-image Stable Diffusion (SD) models as backbone and repurposes it into an image-to-motion estimator. To mitigate inconsistent output produced by diffusion models, we propose Adaptive Ensemble Strategy (AES) that consolidates multiple outputs into a cohesive, high-fidelity result. Additionally, we present the concept of Sampling Steps Disaster (SSD), the counterintuitive scenario where increasing the number of sampling steps can lead to poorer outcomes, which enables our framework to achieve one-step inference. StableMotion is verified on two image rectification tasks and delivers state-of-the-art performance in both, as well as showing strong generalizability. Supported by SSD, StableMotion offers a speedup of 200 times compared to previous diffusion model-based methods.


Beyond Wide-Angle Images: Unsupervised Video Portrait Correction via Spatiotemporal Diffusion Adaptation

Nie, Wenbo, Nie, Lang, Lin, Chunyu, Chen, Jingwen, Xing, Ke, Wang, Jiyuan, Zhao, Yao

arXiv.org Artificial Intelligence

Wide-angle cameras, despite their popularity for content creation, suffer from distortion-induced facial stretching-especially at the edge of the lens-which degrades visual appeal. To address this issue, we propose an image portrait correction framework using diffusion models named ImagePD. It integrates the long-range awareness of transformer and multi-step denoising of diffusion models into a unified framework, achieving global structural robustness and local detail refinement. Besides, considering the high cost of obtaining video labels, we then repurpose ImagePD for unlabeled wide-angle videos (termed VideoPD), by spatiotemporal diffusion adaption with spatial consistency and temporal smoothness constraints. For the former, we encourage the denoised image to approximate pseudo labels following the wide-angle distortion distribution pattern, while for the latter, we derive rectification trajectories with backward optical flows and smooth them. Compared with ImagePD, VideoPD maintains high-quality facial corrections in space and mitigates the potential temporal shakes sequentially. Finally, to establish an evaluation benchmark and train the framework, we establish a video portrait dataset with a large diversity in people number, lighting conditions, and background. Experiments demonstrate that the proposed methods outperform existing solutions quantitatively and qualitatively, contributing to high-fidelity wide-angle videos with stable and natural portraits. The codes and dataset will be available.


LuSeg: Efficient Negative and Positive Obstacles Segmentation via Contrast-Driven Multi-Modal Feature Fusion on the Lunar

Jiao, Shuaifeng, Zeng, Zhiwen, Su, Zhuoqun, Chen, Xieyuanli, Zhou, Zongtan, Lu, Huimin

arXiv.org Artificial Intelligence

As lunar exploration missions grow increasingly complex, ensuring safe and autonomous rover-based surface exploration has become one of the key challenges in lunar exploration tasks. In this work, we have developed a lunar surface simulation system called the Lunar Exploration Simulator System (LESS) and the LunarSeg dataset, which provides RGB-D data for lunar obstacle segmentation that includes both positive and negative obstacles. Additionally, we propose a novel two-stage segmentation network called LuSeg. Through contrastive learning, it enforces semantic consistency between the RGB encoder from Stage I and the depth encoder from Stage II. Experimental results on our proposed LunarSeg dataset and additional public real-world NPO road obstacle dataset demonstrate that LuSeg achieves state-of-the-art segmentation performance for both positive and negative obstacles while maintaining a high inference speed of approximately 57\,Hz. We have released the implementation of our LESS system, LunarSeg dataset, and the code of LuSeg at:https://github.com/nubot-nudt/LuSeg.


Muti-Fidelity Prediction and Uncertainty Quantification with Laplace Neural Operators for Parametric Partial Differential Equations

Zheng, Haoyang, Lin, Guang

arXiv.org Artificial Intelligence

Laplace Neural Operators (LNOs) have recently emerged as a promising approach in scientific machine learning due to the ability to learn nonlinear maps between functional spaces. However, this framework often requires substantial amounts of high-fidelity (HF) training data, which is often prohibitively expensive to acquire. To address this, we propose multi-fidelity Laplace Neural Operators (MF-LNOs), which combine a low-fidelity (LF) base model with parallel linear/nonlinear HF correctors and dynamic inter-fidelity weighting. This allows us to exploit correlations between LF and HF datasets and achieve accurate inference of quantities of interest even with sparse HF data. We further incorporate a modified replica exchange stochastic gradient Langevin algorithm, which enables a more effective posterior distribution estimation and uncertainty quantification in model predictions. Extensive validation across four canonical dynamical systems (the Lorenz system, Duffing oscillator, Burgers equation, and Brusselator reaction-diffusion system) demonstrates the framework's effectiveness. The results show significant improvements, with testing losses reduced by 40% to 80% compared to traditional approaches. This validates MF-LNO as a versatile tool for surrogate modeling in parametric PDEs, offering significant improvements in data efficiency and uncertainty-aware prediction.


We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Qiao, Runqi, Tan, Qiuna, Dong, Guanting, Wu, Minhui, Sun, Chong, Song, Xiaoshuai, GongQue, Zhuoma, Lei, Shanglin, Wei, Zhe, Zhang, Miaoxuan, Qiao, Runfeng, Zhang, Yifan, Zong, Xiao, Xu, Yida, Diao, Muxi, Bao, Zhimin, Li, Chen, Zhang, Honggang

arXiv.org Artificial Intelligence

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.


Networking Systems for Video Anomaly Detection: A Tutorial and Survey

Liu, Jing, Liu, Yang, Lin, Jieyu, Li, Jielin, Sun, Peng, Hu, Bo, Song, Liang, Boukerche, Azzedine, Leung, Victor C. M.

arXiv.org Artificial Intelligence

With the widespread use of surveillance cameras in smart cities [104] and the boom of online video applications powered by 4/5G communication technologies, traditional human inspection is no longer able to accurately monitor the video data generated around the clock, which is not only time-consuming and labor-intensive but also poses the risk of leaking important information (e.g., biometrics and sensitive speech). In contrast, VAD-empowered IoVT applications [54], such as Intelligent Surveillance Systems (IVSS) and automated content analysis platforms, can process massive video streams online and detect events of interest in real-time, sending only noteworthy anomaly parts for human review, significantly reducing data storage and communication costs, and helping to eliminate public concerns about data security and privacy protection. As a result, VAD has gained widespread attention in academia and industry over the last decade and has been used in emerging fields such as information forensics [154], industrial manufacturing [71] in smart cities as well as online content analysis in mobile video applications [153]. VAD extends the data scope of conventional Anomaly Detection (AD) from time series, images, and graphs to video, which not only needs to cope with the endogenous data complexity, but also needs to take into account the computational and communication costs in resource-limited devices [55]. Specifically, the inherent high-dimensional structure of video data, high information density and redundancy, heterogeneity of temporal and spatial patterns, and feature entanglement between foreground targets and background scenes make VAD more challenging than traditional AD tasks at the levels of representation learning and anomaly discrimination [89]. Existing studies [4, 60, 69, 76] have shown that high-performance VAD models need to target the modeling of appearance and motion information, i.e., the difference between regular events and anomalous examples in both spatial and temporal dimensions. In contrast to time series AD that mainly measures periodic temporal patterns of variables, and image AD which only focusing on spatial contextual deviations, VAD needs to extract both discriminative spatial and temporal features from a large amount of redundant information (e.g., repetitive temporal contexts and label-independent data distributions), as well as to learn the differences between normal and anomalous events in terms of their local appearances and global motions [100]. However, video anomalies are ambiguous and subjective [48].